library(data.table)
pm25_2004<-fread('ad_viz_plotval_data_2004.csv')
pm25_2019<-fread('ad_viz_plotval_data_2019.csv')
dim(pm25_2004);dim(pm25_2019)
## [1] 19233 20
## [1] 53156 20
colnames(pm25_2004);colnames(pm25_2019)
## [1] "Date" "Source"
## [3] "Site ID" "POC"
## [5] "Daily Mean PM2.5 Concentration" "UNITS"
## [7] "DAILY_AQI_VALUE" "Site Name"
## [9] "DAILY_OBS_COUNT" "PERCENT_COMPLETE"
## [11] "AQS_PARAMETER_CODE" "AQS_PARAMETER_DESC"
## [13] "CBSA_CODE" "CBSA_NAME"
## [15] "STATE_CODE" "STATE"
## [17] "COUNTY_CODE" "COUNTY"
## [19] "SITE_LATITUDE" "SITE_LONGITUDE"
## [1] "Date" "Source"
## [3] "Site ID" "POC"
## [5] "Daily Mean PM2.5 Concentration" "UNITS"
## [7] "DAILY_AQI_VALUE" "Site Name"
## [9] "DAILY_OBS_COUNT" "PERCENT_COMPLETE"
## [11] "AQS_PARAMETER_CODE" "AQS_PARAMETER_DESC"
## [13] "CBSA_CODE" "CBSA_NAME"
## [15] "STATE_CODE" "STATE"
## [17] "COUNTY_CODE" "COUNTY"
## [19] "SITE_LATITUDE" "SITE_LONGITUDE"
sapply(pm25_2004,class);sapply(pm25_2019,class)
## Date Source
## "character" "character"
## Site ID POC
## "integer" "integer"
## Daily Mean PM2.5 Concentration UNITS
## "numeric" "character"
## DAILY_AQI_VALUE Site Name
## "integer" "character"
## DAILY_OBS_COUNT PERCENT_COMPLETE
## "integer" "numeric"
## AQS_PARAMETER_CODE AQS_PARAMETER_DESC
## "integer" "character"
## CBSA_CODE CBSA_NAME
## "integer" "character"
## STATE_CODE STATE
## "integer" "character"
## COUNTY_CODE COUNTY
## "integer" "character"
## SITE_LATITUDE SITE_LONGITUDE
## "numeric" "numeric"
## Date Source
## "character" "character"
## Site ID POC
## "integer" "integer"
## Daily Mean PM2.5 Concentration UNITS
## "numeric" "character"
## DAILY_AQI_VALUE Site Name
## "integer" "character"
## DAILY_OBS_COUNT PERCENT_COMPLETE
## "integer" "numeric"
## AQS_PARAMETER_CODE AQS_PARAMETER_DESC
## "integer" "character"
## CBSA_CODE CBSA_NAME
## "integer" "character"
## STATE_CODE STATE
## "integer" "character"
## COUNTY_CODE COUNTY
## "integer" "character"
## SITE_LATITUDE SITE_LONGITUDE
## "numeric" "numeric"
summary(pm25_2004$`Daily Mean PM2.5 Concentration`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.10 6.00 10.10 13.13 16.30 251.00
summary(pm25_2019$`Daily Mean PM2.5 Concentration`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.200 4.000 6.500 7.739 9.900 120.900
As a result, the 2004 dataset includes 19233 rows and 20 columns and the 2019 dataset includes 53156 rows and 20 columns. The colnames names and type of the columns are both shown above. Besides, the interested variable, daily mean PM2.5 concentration are summaried as mean, median, 1th and 3rd quantile, minimum and maximum. No NAs are existed in the Daily Mean PM2.5 Concentration variable. However, some negative values are existed, which are abnomal due to concentration of PM2.5 cannot be negative.
pm25<-rbind(pm25_2004,pm25_2019)
pm25$Year<-rep(c(2004,2019),c(nrow(pm25_2004),nrow(pm25_2019)))
head(pm25)
## Date Source Site ID POC Daily Mean PM2.5 Concentration UNITS
## 1: 01/01/2004 AQS 60010007 1 11.0 ug/m3 LC
## 2: 01/02/2004 AQS 60010007 1 12.2 ug/m3 LC
## 3: 01/03/2004 AQS 60010007 1 16.5 ug/m3 LC
## 4: 01/04/2004 AQS 60010007 1 19.5 ug/m3 LC
## 5: 01/05/2004 AQS 60010007 1 11.5 ug/m3 LC
## 6: 01/06/2004 AQS 60010007 1 32.5 ug/m3 LC
## DAILY_AQI_VALUE Site Name DAILY_OBS_COUNT PERCENT_COMPLETE
## 1: 46 Livermore 1 100
## 2: 51 Livermore 1 100
## 3: 60 Livermore 1 100
## 4: 67 Livermore 1 100
## 5: 48 Livermore 1 100
## 6: 94 Livermore 1 100
## AQS_PARAMETER_CODE AQS_PARAMETER_DESC CBSA_CODE
## 1: 88502 Acceptable PM2.5 AQI & Speciation Mass 41860
## 2: 88502 Acceptable PM2.5 AQI & Speciation Mass 41860
## 3: 88502 Acceptable PM2.5 AQI & Speciation Mass 41860
## 4: 88502 Acceptable PM2.5 AQI & Speciation Mass 41860
## 5: 88502 Acceptable PM2.5 AQI & Speciation Mass 41860
## 6: 88502 Acceptable PM2.5 AQI & Speciation Mass 41860
## CBSA_NAME STATE_CODE STATE COUNTY_CODE COUNTY
## 1: San Francisco-Oakland-Hayward, CA 6 California 1 Alameda
## 2: San Francisco-Oakland-Hayward, CA 6 California 1 Alameda
## 3: San Francisco-Oakland-Hayward, CA 6 California 1 Alameda
## 4: San Francisco-Oakland-Hayward, CA 6 California 1 Alameda
## 5: San Francisco-Oakland-Hayward, CA 6 California 1 Alameda
## 6: San Francisco-Oakland-Hayward, CA 6 California 1 Alameda
## SITE_LATITUDE SITE_LONGITUDE Year
## 1: 37.68753 -121.7842 2004
## 2: 37.68753 -121.7842 2004
## 3: 37.68753 -121.7842 2004
## 4: 37.68753 -121.7842 2004
## 5: 37.68753 -121.7842 2004
## 6: 37.68753 -121.7842 2004
library(leaflet)
leaflet(data=pm25)%>%
addTiles()%>%
addCircles(~SITE_LONGITUDE, ~SITE_LATITUDE,
color = ~ifelse(Year==2004,'blue','red'))
In this figure, the blue points represent the locations of the sites for 2004 and the red points represent the locations of the sites for 2019. According to the map, it reveals that the number of sites increase obviously from 2004 to 2019. Besides, the sites for 2019 are almost located at coastal.
As shown in Q1, it is found that no NAs are existed in the variable Daily Mean PM2.5 Concentration, but some negative values are existed in the variable. Thus the proportion of the negative values are shown as:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between() masks data.table::between()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks data.table::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks data.table::last()
## ✖ purrr::transpose() masks data.table::transpose()
pm25%>%mutate(neg_value=`Daily Mean PM2.5 Concentration`<0)%>%
group_by(neg_value)%>%summarise(prop=n()/nrow(pm25))
## # A tibble: 2 × 2
## neg_value prop
## <lgl> <dbl>
## 1 FALSE 0.996
## 2 TRUE 0.00391
As a result, it indicates that only 0.391% values of Daily Mean PM2.5 Concentration are negative, which may caused by some mistakes. To investigate the temporal pattern, the date of negative values are shown as:
pm25%>%filter(`Daily Mean PM2.5 Concentration`<0)%>%
mutate(month=substr(Date,1,2))%>%
ggplot(aes(month))+geom_bar()
As a result, it is revealed that the negative values are often appeares in March and April, but negative in November is lowest.
pm25%>%ggplot(aes(`Daily Mean PM2.5 Concentration`))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Firstly, the distribution of the PM2.5 concentration is investigated. As a result, the distribution is right skewed, which indicates that log-transformation is suitable here.
pm25%>%filter(`Daily Mean PM2.5 Concentration`>0)%>%
group_by(Year)%>%
summarise(log_pm25_mean=mean(log(`Daily Mean PM2.5 Concentration`)),
log_pm25_sd=sd(log(`Daily Mean PM2.5 Concentration`)))
## # A tibble: 2 × 3
## Year log_pm25_mean log_pm25_sd
## <dbl> <dbl> <dbl>
## 1 2004 2.27 0.840
## 2 2019 1.81 0.750
pm25%>%filter(`Daily Mean PM2.5 Concentration`>0)%>%
mutate(Year=as.factor(Year))%>%
ggplot(aes(x=Year,y=`Daily Mean PM2.5 Concentration`))+
geom_boxplot()+scale_y_log10()
As the boxplot log-transformed shown, the level of PM2.5 is decrease from 2004 to 2019, which is also validated by the averages and standard deviances of log(PM2.5) calculated as summary statistic.
pm25%>%filter(`Daily Mean PM2.5 Concentration`>0)%>%
group_by(Year,COUNTY)%>%
summarise(log_pm25_mean=mean(log(`Daily Mean PM2.5 Concentration`)))%>%
pivot_wider(names_from = 'Year',
values_from=log_pm25_mean)%>%
mutate(diff=`2019`-`2004`)
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
## # A tibble: 51 × 4
## COUNTY `2004` `2019` diff
## <chr> <dbl> <dbl> <dbl>
## 1 Alameda 2.22 1.83 -0.385
## 2 Butte 2.01 1.65 -0.366
## 3 Calaveras 1.77 1.58 -0.191
## 4 Colusa 1.97 1.69 -0.279
## 5 Contra Costa 2.27 1.82 -0.452
## 6 Del Norte 0.954 1.35 0.393
## 7 El Dorado 0.875 0.723 -0.152
## 8 Fresno 2.36 1.86 -0.500
## 9 Humboldt 1.87 1.74 -0.131
## 10 Imperial 2.24 2.11 -0.128
## # … with 41 more rows
pm25%>%filter(`Daily Mean PM2.5 Concentration`>0)%>%
mutate(Year=as.factor(Year))%>%
group_by(Year,COUNTY)%>%
summarise(log_pm25_mean=mean(log(`Daily Mean PM2.5 Concentration`)))%>%
ggplot(aes(y=COUNTY,x=log_pm25_mean,colour=Year))+
geom_point()
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
As shown in the scatter plot, the points representing 2004 are almost located right of points representing 2019, which indicates that the PM2.5 for 2019 is lower than 2004 for most of counties. Besides, difference of log_PM2.5 between 2004 and 2019 are also calculated as summary statistic, which are almost lower than 0, which also indicates that for almost counties, PM2.5 in 2019 is lower than 2004.
pm25%>%filter(`Daily Mean PM2.5 Concentration`>0)%>%
group_by(Year,`Site Name`)%>%
summarise(log_pm25_mean=mean(log(`Daily Mean PM2.5 Concentration`)))%>%
pivot_wider(names_from = 'Year',
values_from=log_pm25_mean)%>%
na.omit()%>%
filter(`Site Name`!='')
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
## # A tibble: 77 × 3
## `Site Name` `2004` `2019`
## <chr> <dbl> <dbl>
## 1 Anaheim 2.71 2.08
## 2 Aqua Tibia Wilderness 1.68 1.18
## 3 Azusa 2.74 2.06
## 4 Bakersfield-Airport (Planz) 2.62 2.35
## 5 Bakersfield-California 2.81 2.25
## 6 Bakersfield-Golden / M St 2.83 2.29
## 7 Big Bear 2.08 1.38
## 8 Bliss SP 0.787 0.431
## 9 Brawley-220 Main Street 2.07 2.01
## 10 Calexico-Ethel Street 2.43 2.15
## # … with 67 more rows
Similarly in counties level, the average log_PM2.5 is calculated as summary statistic. To make the comparison reasonable, only sites with observations in both 2004 and 2019 are retained. Due to the number of sites are too many, thus heat map is used rather than scatter plot:
pm25%>%filter(`Daily Mean PM2.5 Concentration`>0)%>%
group_by(Year,`Site Name`)%>%
summarise(log_pm25_mean=mean(log(`Daily Mean PM2.5 Concentration`)))%>%
pivot_wider(names_from = 'Year',
values_from=log_pm25_mean)%>%
na.omit()%>%
filter(`Site Name`!='')%>%
pivot_longer(cols=c(`2004`,`2019`),values_to='log_PM2.5')%>%
ggplot(aes(x=name,y=`Site Name`,fill=`log_PM2.5`))+
geom_tile()+
scale_fill_gradient(low='white',high='black')
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
As a result, the tiles representing 2019 is light than 2004, which indicates that PM2.5 in 2004 is higher than 2019.
Summary, daily concentrations of PM25 has decreased from 2004 to 2019.